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Abstract 

Gaussian processes (GPs) provide a probabilistic nonparametric representation of 
functions in regression, classification, and other problems. Unfortunately, exact 
learning with GPs is intractable for large datasets. A variety of approximate GP 
methods have been proposed that essentially map the large dataset into a small 
set of basis points. The most advanced of these, the variable-sigma GP (VSGP) 
(Walder et al., 2008), allows each basis point to have its own length scale. How- 
ever, VSGP was only derived for regression. We describe how VSGP can be 
applied to classification and other problems, by deriving it as an expectation prop- 
agation algorithm. In this view, sparse GP approximations correspond to a KL- 
projection of the true posterior onto a compact exponential family of GPs. VSGP 
constitutes one such family, and we show how to enlarge this family to get addi- 
tional accuracy. In particular, we show that endowing each basis point with its own 
full covariance matrix provides a significant increase in approximation power. 



1 Introduction 

Gaussian processes (GP) are powerful nonparametric Bayesian approach to modelling unknown 
functions. As such, they can be directly used for classification and regression (Rasmussen & 
Williams, 2006), or embedded into a larger model such as factor analysis (Teh et al., 2005), rela- 
tional learning (Chu et al., 2006), or reinforcement learning (Deisenroth et al., 2009). Unfortunately, 
the cost of GPs can be prohibitive for large datasets. Even for the regression case where the GP pre- 
diction formula is analytic, training the exact GP model with points demands an 0{N^) cost 
for inverting the covariance matrix and predicting a new output requires 0{N'^) cost in addition to 
storing all of the training points. 

Ideally, we would like a compact representation, much smaller than the number of training points, 
of the posterior distribution for the unknown function. This compact representation could be used 

to summarize training data for Bayesian learning, or it could be passed around as a message in order 
to do inference in a larger probabilistic model. One successful approach is to map the training data 
into a small set of basis points, then compute the exact posterior that results from those points. The 
basis points could literally be a subset of the training instances (Csato, 2002; Lawrence et al., 2002) 
or they could be artificial "pseudo-inputs" that represent the training set (Snelson & Ghahramani, 
2006). Furthermore, the GP that is used to for inference on the basis points need not match the 
original GP. In particular, as shown by Walder et al. (2008), the GP applied to the basis points 
should have a longer length scale than the original (since the data is now sparser). Their "variable 
sigma Gaussian process" (VSGP) algorithm (Walder et al., 2008) allows a different length scale for 
each basis point. 
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For classification problems, a further approximation step is needed. The training set needs to not 
only be reduced but also transformed from classification data to regression data. In this paper, 
we show how both of these approximation steps can be accomplished in the same framework of 
expectation propagation (EP). The novel contributions of this paper include: 

• Derive VSGP and other sparse GP algorithms as EP algorithms employing a particular 

choice of approximating family, 

• Extend VSGP to use a full covariance matrix around each basis point, 

• Show how to apply the VSGP approximation to classification and other likelihoods, and 

• Evaluate the benefit of full covariance matrices on classification and regression problems 
(section 5). 

2 Gaussian process regression and classification 

We denote independent and identically distributed samples as V = {(xi, yi), . . . , (x„, yn)}N, 
where x, is a d dimensional input and y, is a scalar output. We assume there is a latent function / 
that we are modeling and the noisy realization of latent function / at x^ is y-i. 

A Gaussian process places a prior distribution over the latent function /. Its projection at the samples 
{xj} defines a joint Gaussian distribution: 

pi{)=Afi{\m°,K) 

where m° = mP{Tii) is the mean function and Kij = K{'x.i,Xj) is the covariance function, which 
encodes the prior notation of smoothness. Normally the mean function is simply set to be zero and 
we follow this tradition in this paper. A typical kernel covariance function is the squared exponential, 
also know as Radial Basis Function (RBF), 

llx' -xlP 

fc(x, x') = exp( ^ ), (I) 

where o- is a hyperparameter. 

For regression, we use a Gaussian likelihood function 

Piyi\f)=^fiy^\f{^^),vy) (2) 

where Vy is the observation noise. For classification, the data likelihood has the form 

piVilf) = (1 - e)0(/(xi)2/i) + e4>i-fi^i)yi) (3) 

where e models the labeling error and <f>{-) is a nonlinear function, ie., a cumulative Gaussian distri- 
bution or a step function such that (p{ f{xi)) = 1 if /(.r;) < and (f>{f{xi)) = otherwise. 

Given the Gaussian process prior over / and the data likelihood, the posterior process is 

N 

pif\V,t)<xGPif\0,K)l[p{yi\f) (4) 

i=l 

Since the Gaussian process is grounded on the N examples, they are called the basis points. 

For the regression problem, the posterior process has an analytical form. But to make a prediction 
on a new sample, we need to invert aNhy N matrix. If the training set is big, this matrix inversion 
will be too costly. For classification or other nonlinear problems, the computational cost is even 
higher since we do not have a analytical solution to the posterior process and the complexity of the 
process grows with the number of training samples. 

3 Adaptive sparse Gaussian processes 

In this section we present the novel sparse Gaussian process model coupled with an efficient infer- 
ence algorithm. 
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3.1 Model 

To save the computational and memory cost, we approximate the exact Gaussian process model (4) 
by a sparse model parameterized by (u, b, c, A): 

M 

q{f) cx GP(/|0, K) n ^{uk\ / /(x)Ar(x|6fc, Cfc)dx, A^^) (5) 
fc=i 

In this sparse model, we can change the number of basis points, M, to regulate the model complexity. 
Normally, we set M <^ N. Importantly, the basis point hk is blurred by the Gaussian distribution 
A/'(x|6fc,Ci;), which can represent how the data distributed locally around hu- The parameter Uk 
represents a virtual regression target for hk, and Afe is its precision. 

The distribution (5) is a Gaussian process, so it has a mean and covariance function. To determine 
these, first consider the following functional: 

Z(m°) = j GPif\m\K)Afiu\gB{f),A-')df (6) 

where u = (ui, . . . , mm), the subscript B denote the basis set (6i, . . . , &m), and 

gB(/)= [|/(x)Ar(x|6i,ci)dx,...,y/(x)AA(x|6M,CM)dx]^ (7) 
A-i =diag(Ai-\A2-\...,A^;) (8) 
Now gs(/) follows a joint Gaussian distribution with the mean ih^ and V^: 

m% = II /(x)AA(x|6,-, c,)GP(/|m°, if )dxd/ = j m°(x)AA(x|6,-, c,)dx (9) 

^Bij = J yAr(x|6„Ci)if(x,x')Ar(x'|6,-,c,)dxdx' (10) 

Therefore, we can compute Z{m°) as follows: 

ZK) = I M{u\sB,A-^)MigB\rh%V%)df=M{u\rh%r^) (11) 

where /3 = (V% + A-^)"!. Define a. = 3u. Then the mean and the covariance functions of the 
sparse model are characterized by a and In particular, we have the following theorem to describe 
the relationship between them. 

Theorem 1 The posterior process q{f) defined in (5) has the mean function m(x) and covariance 
function V{x.,-x'): 

m{x) = K{x,B)a (12) 
y (x, x') = Jr(x, x') - ^(x, B)/3K{B, x') (13) 

where k{i^,B) = [^(x, 6i), . . . , ^(x, . . . , ^(x, 6m)], -^(B,x) = (^(x,B))T and 
k{y.,bj) = /ii'(x,x')AA(x'|&j,Cj)dx'. 

When using the RBF covariance function (1), we have 

k{x, B) = (27rc72)^/2[AA(x|6i, ci + cr^I), . . . , AA(x|&m, cm + a''\)]. (14) 

Proof: First, consider the minimization of the KL divergence between the posterior process 
q{f ) oc GP(/|m", _ft')A/'(u|gB(/), A~^) for the sparse model and a process in the expo- 
nential family. Since q{f) also belongs to the exponential family, this minimization will achieve the 
optimal solution, i.e., q{f) = q{f). 
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Now to obtain the mean function m(x) and tlie covariance function V{tc, x') for q{f), we can solve 
by the KL minimization. This leads to the following moment matching equations: 

m(x) = m°(x) + j (x, a') ^ll^^da' (15) 

F(x,x') = i^(x,x') /^(x,a)^^^^^^|g^i^(a',x')dada' (16) 

Based on (11), it is easy to obtain 

= [AA(x|6i, ci), . . . , 7\A(x|6m, cm)]/?(u - liis) (18) 
Combining (15) and (18) gives 

m(x)=m°(x) + yif(x,x)[A/-(x|&i,ci),...,A/'(x|6M,CMMu-p)dx (19) 

= m°(x) + ^(x, B)/3(u -p)= m°(x) + K(x, B){a - l3m%) (20) 

where K{x, B) is defined in (14). Setting the prior mean function m°(x) = 0, we have ih^ = 0. 
As a result, equation (12) holds. 

From (18) it follows that 

d^logZ ., , .. dp 



[AA(x|6i, ci), . . . ,AA(x|6m, cmW^^zJj::^ (21) 

T 

(22) 



dmO(x)dmO(x') v^i-i, w;, - - • v^i"'" ' -^";j,.^^o(x') 

= -[AA(x|6i, ci), . . . ,AA(x|6m, cm)]/3[AA(x'|&i, ci), . . . , AA(x'|6m, cm)]^ 



Based on the above equation and (16), we have 

y(x, x') = K{yi, x') - X(x, B)f3k{B, x') (23) 

Thus (13) holds. □ 

Blurring the Gaussian process by the local Gaussian distribution J\f{'x.\bi, Ci), we obtain the follow- 
ing corollary: 

Corollary 2 The projection of the blurred Gaussian posterior process q{f) onto B is a Gaussian 
distribution with the following mean and covariance: 

liiB = Ka (24) 
Vb = K - Kl3K^ (25) 

where Kij = J /AA(x|6j, Ci)ii'(x, x')A/'(x'|&j-, Cj)dxdx'. 

Now the remaining question is how to estimate (u, A), or equivalently (a, (3), such that the approx- 
imate posterior distribution on B well approximates the exact posterior. In the following section, we 
describe an inference method for the sparse model. 

3.2 Inference by expectation propagation 

Expectation propagation (Minka, 2001) has three steps, message deletion, data projection, and mes- 
sage updates, iteratively applied to each training point. In the message deletion step, we compute 
the partial belief m^', v^^) by removing a message U (from the i-th point) from the approx- 
imate posterior g°''^(/|m, v). In the data projection step, we minimize the KL divergence between 
Pif) viti] f)<l{f \ rri^'^, '(;\*) and the new approximate posterior q°^^{f\rn, v), such that the infor- 
mation from each data point is incorporated into the model. Finally, the message ii is updated based 
on the new and old posteriors. 
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Based on (5), the sparse GP is an exponential family witli features (gB(/), gB(/)gs(/)^)- As a 
result, we can determine the sparse GP that minimizes K L{p{f)\q{f)) by matching the moments 

on gBU). 

Similar to (15) and (16), the moment matching equations are 

ihB=iii\; + FV(B,x0/^ (26) 



(27) 
(28) 



V.=vV.V-V(i,,x,^AJ|?_v-V(x.,i<) 
where y\'(B,x)j = J Af{x'\bj,Cj)V\'{x' ,x)dx' 
Combining (24) and (26) gives 

b = K-^V\\B, Xi) =pi- p\ik{B, Xi) (30) 

Y>^ = i^-^k{B,x,) (31) 

where we use (13) to obtain the last equation in the second line. 

From (25) we have /3 = (K — Vb) (K~^)^. Inserting (27) into this equation gives 

(dm\'(xj))2 

These equations define the projection update. This update can be equivalently interpreted as multi- 
plying by an approximate factor ti(/) defined as: 

iiU)=^{Y.P^i 1 /W-^W^i.c,)dx|ft,r-i) (33) 

Tr^ = {-Vl\ogZ)-^-K{x,,B)h (34) 

ft = mV(x,) + (-V^logZ)-iV„,logZ (35) 

The approximation factor can be viewed as a message from the i-th data point to the sparse 
GP. To check the validity of this update, we compute 

Z = I Uf)q'^\f)df cxX{u,\pJm);,T-' +pjv);p,) (36) 

= X{u,\m\\x,),T-^ + k{x,,B}h) (37) 

which has the same derivatives as the original Z = J ti{f )q^''{f)df. Therefore, the multiplication 
leads to the same q{f\m, v). In other words, we have oc q{f)/q^^{f)- 

To delete this message, we multiple its reciprocal with the current <?(/). Using the same trick as 
before, we can solve the multiplication using the following moment matching equations: 

h = p, -/3K(B,x,) (38) 



Zd = I ^a\X/)d/ « AA(M,|pTm\;, -T-' + if(x„ B)h) 



(39) 
(40) 



= a + h- — — - 
dm(xi) 

(dm(xi))2 



(42) 
(43) 
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The inference algorithm is summarized in Algorithm 1 . 



1. Initialize q{f), (ji, and all to be 0. 

2. Loop until the change over all gi, and Tj is smaller 
than a threshold 

Loop over all training data point 
Deletion. Compute and /3\' for via (42)(43) 

Projection. Compute a and B for q(/) via (29)(32) 
Inclusion. Update g.^, and for the message U via (34)(35) 

Algorithm 1: Expectation Propagation Inference for Sparse GP 



3.3 Regression 

Given the linear regression likelihood (2), the quantities in the projection step (29)(32) are 
dlogZ yi — m\^{xi) d^logZ —1 

dm(xj) Vy + v\^{xi,Xi) (dm(xj))2 Vy + v\'-{xi,Xi) 

3.4 Classification 



(44) 



Given the classification likelihood (3) where 4>{-) is the step function, the quantities in the projection 
step (29)(32) are 

. = (45) 

Z = e+{1- 2e)iP{z) (46) 
dlogZ 



IVi (47) 

(48) 



dm(xj 

d^logZ _J{m^^{x^)yi + v\'{xi,Xi)J) 



(dm(xi))2 v\%Xi,Xi) 
where 7 = ^nd ) is the standard Gaussian cumulative distribution function. 

4 Related work 



Most work in sparse GP approximation has been for regression with Gaussian noise, where the exact 
posterior distribution is a GP whose moments are available in closed form. The goal in that case is to 
approximate the exact GP with a simpler GP. Quinonero-Candela and Rasmussen (2005) compared 
several such approximations within a common framework, highlighting the FITC approximation as 
an improvement over DTC. Walder et al. (2008) later generalized FITC to include basis-dependent 
lengthscales, and showed that this approximation fits into the framework of Quinonero-Candela and 
Rasmussen (2005). 

For classification with GPs, few papers have imposed an explicit sparsity constraint on the GP pos- 
terior. One reason for this is that sparse GP approximations have only been understood from a 
regression perspective. Csato and Opper (2000) showed how online GP classification could be made 
sparse, using an approximation equivalent to FITC. However, the FITC approximation was intro- 
duced as a subroutine, not presented as a part of EP itself. Csato (2002) later gave a batch EP 
algorithm, where a different sparse GP approximation (DTC) was used as a subroutine within EP. 
The software distributed by Csato can run either online or batch and has an option to use either FITC 
or DTC. In this paper, we clarify Csato and Opper's work by showing that the FITC approximation 
can in fact be viewed as part of the overall EP approximation. Furthermore we extend it to include 
VSGP approximation. 
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5 Experiments 



We evaluate the new sparse GP method on both synthetic and real world data for regression and 
classification tasks and compare its predictive performance with alternative sparse GP methods. We 
use the RBF kernels for all the experiments. To all the sparse GP models, we need to choose basis 
centers. Instead of using the evidence framework, we use the entropy reduction criteria (equations 
are omitted to save space) to search over the hierarchical partitions of the data. We then use the 
cluster centers as the basis points and use the data partitions to compute the local covariance matrix. 
Our focus is on the approximate inference method so we use the same basis points for all the methods 
being evaluated. 

We first examine the performance of the new method on synthetic data for the regression task. Since 
we can control the generative process of the synthetic data, it is easier for us to gain insight into how 
the method performs. For regression, we sample 100 data points along a circle with some additive 
Gaussian noise. The output in different quadrant has different values plus certain additive Gaussian 
noise. 
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(c). VSGP (d). New 

Figure 1 : Illustration on simple circle data. The heatmaps represent the values of the posterior means 
of different methods. Red ellipses and crosses are the mean and the standard deviation of local 
covariances for the sparse GP model. The white dots are the training data points. EP estimation 
uses the full local covariance matrix in (d), significantly improving the estimation accuracy along 
the circle. 



The mean of the exact posterior distribution of the full GP is shown in figure 1 .a. The posterior 
means of FITC, SOGP, and the new method with sphere and full local covariance matrices. Four 
basis points are chosen as the centers learned by K-means clustering. Note that when the local 
covariance matrices become sphere matrices, our method reduces to the multiscale method (Walder 
et al., 2008). For FITC, we performed cross-validation to choose the right kernel width for the RBF 
kernel. As shown in the figure, the use of the full local covariance matrix improves the estimation 
accuracy. 

Then we perform test FITC, SOGP, and the new methods with sphere and full local covariance 
matrices on a subset of the pumadyn-32nm dataset. On this dataset, however, we do not see the 
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(a). Full GP 
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(c). VSGP (d). New 

Figure 2: Classification on synthetic data. The blue and red ellipses show the standard deviation of 
local covariances for the sparse GP model. The black curve is the decision boundary. With only 
three basis points, the true, complex decision boundary in (a) is well approximated by an ellipse by 
our method (d). 




further improvement of our new method over FITC. This is consistent with the results reported 
in (Walder et al., 2008). 
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We then test all the methods on synthetic data for a classification task. Each class of the data is 
sampled from a Gaussian and shown in 3. It also shows the decisions boundaries and the basis 
points. Figure 3 illustrates the advantages of our new approach. Since we simply use K-means to 
choose the basis points, further improvement for FITCis possible. We then sample 200 points for 
training and 2000 for testing. The experiments are repeated 10 times and the results are shown in 
figure 3. In this figure, we also report results of SOGP. 

Finally we tested our methods on a standard UCI dataset. Ionosphere. The results are summarized 
in figure 3. Again, we outperform the previous methods. 

6 Conclusions 

In this paper, we present a new perspective to understand sparse GP models using the expectation 
propagation framework and develop new inference methods based on this perspective. Empirical 
results demonstrate improved prediction accuracy with the new extensions. 
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